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Abstract: Recent work has shown that expression level is the main pre- 
dictor of a gene's evolutionary rate, and that more highly expressed genes 
evolve slower. A possible explanation for this observation is selection for 
proteins which fold properly despite mistranslation, in short selection for 
translational robustness. Translational robustness leads to the somewhat 
paradoxical prediction that highly expressed genes are extremely tolerant to 
missense substitutions but nevertheless evolve very slowly. Here, we study 
a simple theoretical model of translational robustness that allows us to gain 
analytic insight into how this paradoxical behavior arises. 
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INTRODUCTION 



The increasing availability of whole-genome sequences from many differ- 
ent species has revealed a surprising fact: Different genes within the same 
organism evolve at dramatically different rates. For example, the evolu- 
tionary rates of the fastest- and slowest-evolving genes in Saccharomyces 



cerev isiae are separated by three orders of magnitude (D rummond et al. 



20051 ). Because the dominant force shaping genome- wide patterns of evo- 
lutionary rate is most likely stabilizing selection, the evolutionary rates of 
genes should correlate wit h quantities that measure ho w important or other- 



wise constrained a gene is (|Kimura 



1983; 



Ohta 



19921 ) . A wide array of such 



quantities have been proposed, shown to correlate with evolutionary rate, and 
subseq uently disputed. Examples include a gene's dispensability or essen- 



tiality (IHurst and Sm ith 



3; 



1999: 



HlRSH and Fraser 



hJ 200.4 I 



2001 



Jordan et al. 



2002: 



Pal et al 



200. 



Zhang and 



ber of interaction partners (IF RASER et a[ 



Jordan et al 



2003; 



Iahn 



(IMarais and Duret 



2001 



work (HAHN and KERN 



the expression level ()Pal et al 



et al 



2004; 



Wall et al 



2002; 



200 51), its num 



Bloom and Adami 



Agrafioti et al 



2QqJ, its length 



2003; 



), or its centrality in the protein interaction net- 



2005V However, it seems that most imp ortantly 



2001 



ROCHA and DANC HIN 



haps more accurately the frequency of translation events ()Drummond et al 



2004|h or oer- 



2005 



200 6f), influence evolut ionary rate. 



Drummond et al. 



(2005J) have recently introduced a theory for why highly 
expressed genes evolve slowly Translation is error prone, and inactivated 
or misfolde d proteins resulting from m istranslation impose substantial costs 



on the cell ( Bucciantini et al 



2002), costs which increase with expression 



level. One way in which the cost associated with a hig hly expressed gene is 



reduced is translational accuracy ()AkashiII1994L 120011 ). whereby the gene is 
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encoded with optimal codons whose translation is less error-prone than the 
translation of other codons. Translational accuracy can explain why the rate 
of synonymous evolution dS is correlated with expression level, codon adapta- 
tion index, or protein abundance. However, it cannot explain why the rate of 
non-synon ymous evolution dN show s an even stronger correlation with these 
quantities ( Drummond et a/l2005 ). Selection for tra nslational accuracy can 
reduc e the translational error rate by a factor of 4-9 (Pr ecup and Parker 



19871 ). but even optimally coded genes that are highly exp ressed may produce 



a larg e amount of erroneous polypeptides. Therefore, IDrummond et al. 
(2005) suggest that highly expressed genes should be under additional selec- 
tion to be tolerant to translation errors, that is, the polypeptides produced 
from these genes should fold properly even if they were erroneously trans- 
lated. Recent work on protein biochemistry has shown that proteins can 
differ widely in their tolerance to missense substitutions, and that properly 
chosen point mutations c an dramatically incr e ase the tolerance of a protein to 



additional substitutions rtBLOOM et al. 



2005). 



Drummond et al. 



s hypothe- 



sis, referred to as selection for translational robustness, predicts a constraint 
on the non-synonymous rate of evolution, whereas selection for translational 
accuracy predicts primarily a constraint on the synonymous rate of evolu- 
tion. We must assume that both selective constraints will operate on genes 
that are frequently translated. 

Genes that are highly tolerant to translational missense errors must also, 
by definition, be highly tolerant to missense substitutions. However, the 
translational robustness hypothesis predicts that these genes will nevertheless 
be strongly conserved under evolution. An example of a gene that exhibits 
this paradoxical behavior is Rubisco, an extremely abundant protein which 
fixes carbon dioxide during photosynthesis. Rubisco is strongly conserved 
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across phyla, but appears to t olerate most miss e nse substitutions in the 



labor atory without loss of fold (|SpreitzerI Il993 : 



Kellogg and Juliano 



19971) . 

The purpose of the present paper is to put the translational robustness 
hypothesis into precise mathematical terms, and to demonstrate how highly 
expressed genes can evolve to be tolerant to missense substitutions and yet 
remain strongly conserved under evolution. 

MATERIALS AND METHODS 

Model: We consider the evolution of a gene encoding a protein of length L. 
Each site in the protein can be in one of two states, neutral or non-neutral. 
We denote the number of neutral sites in the protein by n. A mutation at a 
neutral site of a folded protein leaves the protein folded, but changes the site 
from neutral to non-neutral. A mutation at a non-neutral site of a folded 
protein will usually cause misfolding and consequent loss of function, but 
with a small probability a, such a mutation will not affect folding but turn 
the site into a neutral one. For simplicity, we assume that once an amino- 
acid sequence loses the ability to fold, it is impossible to mutate it back into 
a folded state. The rationale behind this assumption is that the likelihood 
of a mutation restoring fold to an unfolded amino-acid sequence is so low as 
to be negligible. Our model is a reasonable abstraction of a thermodynamic 



predictive power fo r both simulated and real proteins (jBLOOM et al 



Wilke et al. 



shown to have 


good 


Bloom et al 


2005; 



2005). The key insight of this framework is that a protein's 
tolerance to substitutions is closely related to the protein's stability — more 
stable proteins can withstand more missense substitutions — and that there- 
fore proteins can change from being highly fragile to being highly robust to 
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mutations and vice versa through the accumulation of stabilizing or destabi- 
lizing mutations. In this sense, a mutation from a non-neutral to a neutral 
site in our model corresponds to a stabilizing mutation, and the opposite 
mutation corresponds to a destabilizing mutation. Thus, our model captures 
the following key aspects of protein biochemistry: (i) Homologous proteins 
can vary widely in their tolerance to mutations, and individual point muta- 
tions can increase or decrease this tolerance, (ii) Mutations that increase a 
protein's tolerance are much rarer than mutations that decrease its tolerance, 
(iii) Highly tolerant proteins are extremely rare, while moderately tolerant 
proteins are abundant, (iv) Non-functional mutant proteins are likely to be 
misfolded. 

The gene is expressed at a level that leads to the synthesis of x polypep- 
tides. For simplicity, we assume that the total number of polypeptides trans- 
lated per gene is proportional to the gene's expression level, and that the 
constant of proportionality is 1. Thus, x is also the expression level, mea- 
sured in mRNA molecules per cell. Under translation, each site is mistrans- 
lated with probability r (we neglect premature termination of the translation 
process) . The probability that a single mRNA molecule is mistranslated and 
leads to a misfolded protein is 1 — [1 — r(l — a)] L ~ n « t(L — n), where 
the approximation holds for a,r <C 1. Let / = t(L — n) be the fraction 
of synthesized proteins that misfold. We assume that the expression level is 
regulated such that the number of folded proteins per gene, the protein abun- 
dance A, is held constant, regardless of the translational error rate. Then, 
A = x(l — f). The total number of misfolded polypeptides per gene follows 
as xf = Af/(1 — f). Finally, we assume that each misfolded polypeptide 
imposes a cost c on the cell, so that the total cost of a gene translated at 
abundance level A is cAf/(l — f). We turn this cost into a fitness value by 
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assuming that each misfolded protein has the same relative effect on fitness. 
Then we can write the overall fitness of a gene with n neutral sites as 

r(L-n) 



UK, 



exp I — cA- 



l-r(L-n) 

Without loss of generality, we use c = 1 henceforth. 



(1) 



Simulations: We implemented a stochastic simulation of N sequences re- 
producing in discrete, non-overlapping generations. We employed standard 
Wright-Fisher sampling, that is, the probability that a sequence in genera- 
tion t+ 1 is the offspring of a sequence at generation t is directly proportional 
to the latter's fitness. 

We measured the evolutionary rate along the line of descent from the 
most recent common anc estor (MRCA ) of the final population backwards 



in time, as described bv IWilkeI 1)20041 ). Briefly, we let the simulated popu- 
lation evolve until the birth-time of the population's MRCA, designated to, 
exceeded a fixed equilibration time t equ n plus a time window t mea s, *o > ^equii + 
^meas- All quantities were measured on the equilibrated population during this 



latter time window. For all results reported, we chose t. 



equil 



400000, 



U = 0.001, r = 0.001, and L = 100. At all parameter settings, we carried 
out 100 replicas and averaged results over all replicas. 

ANALYTICAL RESULTS 

Solution based on Sella-Hirsh theory: We can calculate the steady-state 
solution of our model using the analogy be tween evolutionary biolog y and 
statistical physics recently demonst rated by 



Sella and H irsh 



(2005). The 



theory of ISella and Hirshi (j2005J) is applicable whenever the product of 
population size and mutation rate is much smaller than one, NU < 1. In 
this regime, the population is essentially homogeneous at all times, and can 
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be represented at any given point in time by a single sequence. We say the 
population is in state i if the dominant sequence in the population is sequence 



i. The key insight of ISella and Hirshi (|2005l ) is that the probability pi to 



find the population in state % is proportional to a function F(i) (also called 
a Boltzmann factor) that depends only on the fitness of sequence i, the 
population size, and details of the mutation process. Thus, it follows that 

Pi = F(i)/Y,F(j), (2) 

3 

where the sum runs over all possible sequences j. Once we have the prob- 
abilities Pi, we can calculate all observable quantities of interest, such as 
expected fitness and expected evolutionary rate, using standard probability 
theory (see also below). 

As the fitness of a sequence in our model depends only on the sequence's 
number of neutral sites n, it is useful to lump all sequences with the same 
n into a single class, and calculate the probability p n that the population 
is in a state represented by any sequence with n neutral sites. Since there 
are f) such sequences, we introduce F'(n) = (^)F(n), where F(n) is the 
appropriate Boltzmann factor for a single sequence with n neutral sites, and 
then calculate p n as p n = F'(n)/ Y^k=o^'(^)- m our case > F'(n) is given by 

F'(n) = (^) exp[(2iV - 2) ln(w n ) + n ln(a)] (3) 

with w n as defined in Eq. (Q). The second term in the exponential takes 
into account the asymmetry in the mutation process, that is, mutations that 
i ncrease n are a factor a less likely to occur than mutations that decrease n 



(Se lla and Hi rsh 



2005L supplementary text). 



With the formalism outlined in the previous paragraphs, we can calculate 



8 



the expected number of neutral sites in the steady state E[n] as 



E[n] = ^nF'{n)/^F'{k) 



(4) 



71=0 



fc=0 



and the expected fitness as 



EM =£>„F(n)/ (5) 

n=0 k=0 

Note that the expected values are not taken over the population (which is 
assumed to be homogeneous), but over a long time- window in the steady 
state. Next we can calculate the evolutionary rate K, that is, the expected 
number of amino-acid substitutions per unit time that accumulate along the 
line of descent in an equilibrated population. We find 

L L 

K = NU[a7r(n ^n + l) + n(n^n- l)]F'(n) / ^ F'(k) , (6) 

n=0 fc=0 

where 7r(i — > ?') is the pr obability of fixation of sequence j in background i 



( Se lla and Hi rsh 



2005|) 



-2 ln(wj /vii) 



(7) 



[Note that n(L -> L + 1) := and tt(0 -> -1) := 0.] 

We can simplify these expressions in the special cases that A is either 
very large or very small. After inserting Eq. (JIJ into Eq. (|5J) . we have 



r? 



exp 



-■IX A . r(L . n) : +nln(a) 



(8) 



1 -r(L-n 

where we have also made the approximation iV — 1 w iV. We will continue 
to use this approximation throughout the rest of the paper. From Eq. (JHJ), 
we can see that the behavior of the system changes drastically depending on 
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whether the product NA is large or small. However, since the population size 
N is the same for all genes in a species while each gene's corresponding protein 
abundance A can vary over many orders of magnitude, in the following we 
assume that N is fixed and consider the limits of very small and very large 
A. 

For A — > 0, the first term in the exponential disappears, and F'(n) be- 
comes simply (^)« n - Thus, we find 

n=0 v 7 n=0 v 7 

= aL/(a + l). (9) 

We cannot obtain similarly simple expressions for E[u>] and K in this limit, 
but will do so in the next subsection using a different method. 

For A —> oo, we have to distinguish between the cases n = L and n < L. 
For n < L, the first term in the exponential in Eq. (jHj) becomes much larger in 
magnitude than the second term, which is a constant. Thus, we can neglect 
the second term, and have 

F'(n) = ( L ) exp [ - 2NA t{L n) } . (10) 
\n J L 1 — t(L — njJ 

This expression tends to for large A. For n = L, we have F'(L) = a L , 
independent of A. All terms but the n = L term disappear, and we have 
E[n] = L, E[w] = wl, and 

K = NUn(L -> L - 1) 

= NUe-wArni-T) (n) 

Approximate solution: Sella-Hirsh theory yields an exact solution for our 
model. However, the resulting expressions are somewhat unwieldy, and don't 
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lead to simple analytical expressions for intermediate A. Therefore, we will 
now derive an approximate solution to our model. 

For small r, we can approximate Wi ~ exp[— At(L— i)], and find ln(wj/wi) - 
Ar(j-i). This approximation is equivalent to neglecting the small number of 
additional translation events required to replace polypeptides that misfold. 
The probability of fixation follows from Eq. (J7|) as 

1 _ e -2Ar(j-i) 

<i^j)= I _ e -2NArU-i) ■ ( 12 ) 

The idea of the approximate solution is that in the steady state, mutations 
that increase the number of neutral sites and those that decrease it are in 
perfect balance. Therefore, the number of neutral sites in the steady state, 
n*, satisfies: 

a{L - n*)n(n* ->• n* + 1) = n*7i(n* -»• n* - 1) . (13) 

According to Eq. (|12p. 7i(n — > n + 1) and 7r(n — > n — 1) are independent of 
n. We introduce 7r + and 7r_, the probabilities of fixation for a mutation that 
increases or decreases n by one, respectively, and find: 

1 - e~ 2Ar 

n + = I _ e -2NAr ' ( 14 ) 

1 - e 2Ar 

n - = 1 _ e 2NAr ■ ( 15 ) 

After inserting these expressions into Eq. EI] and solving for n*, we obtain 

aLn+ 

n = . (16) 

an + + 7r_ 

With this result, the expected fitness in the steady state becomes 

E[w] = exp[-Ar(L - n)\ , (17) 
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and the evolutionary rate K follows as 

K = Nu(a7r + ^^- + vr_^) . (18) 

In the Appendix, we derive an expression for K as a function of n*, rather 
than as a function of A as we have done here. 

In the limit A — > 0, we have 7r + = 7r_ = 1/N and n* = aL/(a + 1) . 
[Note that this expression is identical to the result found through Sella-Hirsh 
theory, Eq. ©, if we equate n* with E[n].] Therefore, in this limit, 

K = aU. (19) 

In the limit A — > oo, we have 7r+ = 1, 7r_ = e _27V " 4r , and n* = L . Therefore, 
in this limit, 

K = NUe~ 2NAr . (20) 

The expressions for n* and K again agree with the results found through 
Sella-Hirsh theory, if we assume 1 — r w 1 in Eq. (fTTj) . 

Limitations on the number of neutral sites: Certain residues may never 
tolerate any substitutions, such as the active-site serine of a serine protease 
or the heme-binding histidines of hemoglobin. Under the assumption that 
m sites can never be neutral, we can write L = I + m, and for small r the 
fitness w n is approximately 

Wn = e -AT(l+m-n) = e -Arm e -Ar{l-n)^ ^ 

In other words, all fitness values are rescaled by a factor depending on r but 
not on n. Eq. (JZJ) reveals that such a rescaling leaves the fixation probabilities 
unchanged. Therefore, the approximate solution remains unchanged except 
that we replace L by / (= L — m) everywhere and add a factor e~ Arm to 
Eq. (PU). 

12 



For Sella-Hirsh theory, if we assume that rm is negligibly small, the 
Boltzmann factor F'(n) gains a similar leading factor which, as Eqs. and 
(JSJ) make clear, also drops out, this time through the normalization term. In 
the case of E[n], the result is only that L must again be replaced by / = L— m, 
while the expected value E[u>] also gains a leading factor e~ Arm , just as in 
the approximate case. 

In short, when there are m never-neutral sites, the main effects are to 
lower the population's fitness by a factor e ~ Arm and to reduce the expected 
number of neutral sites E[n] and the evolutionary rate K roughly as though 
the evolving gene were shorter by m residues. 

SIMULATION RESULTS 

First, we studied the rate of evolution K as a function of the protein 
abundance A, for various population sizes (Fig. GJ. We found that K levels 
off for A — > 0. The asymptotic value at A = is K(0) = aU (Eq. ITHjl . 
For increasing A, K first increases, and then rapidly drops off to zero for 
even larger A. The main effect of the population size N is to determine at 
what level of A this drop-off initiates. With increasing N, the evolutionary 
rate K seems to be simply shifted to the left, towards lower A (Fig. Q). We 
can make this statement more precise by considering the large- A limit of our 
approximate solution, Eq. [20J This limit shows that the evolutionary rates 
K(N, A) and K(aN, a' 1 A) are related through K(N, A) » a^K^aN, a" 1 A), 
where a is an arbitrary constant. Therefore, if we increase N by a factor of 
a, the resulting curve K(A) appears on a log-log plot to be shifted upwards 
and to the left by an amount of log(a). The upwards shift cannot be noticed, 
because it exists only for very large A, and thus the effect of increasing N 
seems to be to simply shift the K(A) curve to the left. 
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Second, we studied the effect of varying a on K(A) (Fig.[5J). The variable 
a mainly influences the asymptotic limit of K(A) for small A (with K(A) 
increasing with a), but does not affect how quickly K(A) decays for large A. 

Third, we studied the behavior of the expected fitness and the expected 
number of neutral sites for varying levels of A. The expected fitness is ap- 
proximately 1 for both very low and very high A, but drops below 1 for the 
intermediate values of A for which K(A) starts to decay (Fig. E^,). The ex- 
pected number of neutral sites is olL/ (a + 1) for very small A, and increases 
to 1 for large A (Fig. Efc>)- We can understand the different evolutionary 
regimes of low A and high A as follows: For low A, since very little protein is 
synthesized, the cost associated with misfolded proteins following erroneous 
translation is negligible. Therefore, the expected fitness in this regime is 
1. Since the cost of misfolding is negligible, the number of neutral sites is 
not under selection in this regime, and it settles to the value at which the 
mutations increasing n and those decreasing n exactly balance each other. 
For high A, on the other hand, the cost of translation-induced misfolding is 
tremendous. Therefore, at high A, the population converges to the single 
optimal sequence with n = L (or n — I if some sites can never be neutral). 
Every mutation that reduces n by even 1 is highly deleterious, and therefore 
will virtually never go to fixation, even in a small population. For I = L, the 
optimal sequence (which has n = I) pays no cost whatsoever under mistrans- 
lation, and the expected fitness is again 1. For intermediate A, the cost of 
translation- induced misfolding is significant but not debilitating. As a result, 
n is elevated in comparison to its low- A limit, but the expected fitness falls 
below 1. 

Finally, the calculation in the Appendix predicts that the evolutionary 
rate should be independent of N if we plot it as a function of the expected 
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number of neutral sites E[n] rather than a function of A. Figure 0] shows that 
this prediction is indeed accurate. We find that there are two distinct regimes 
for the evolutionary rate. For small E[n], the evolutionary rate increases 
with E[n]. This increase is caused by the increased availability of neutral 
mutations with growing E[n]. However, even though we can calculate what 
the evolutionary rate would be for arbitrarily small E[n], in equilibrium E[n] 
will never be below its limiting value for small A, aLj (a + 1). For large E[n], 
the behavior of K is reversed, and now it decreases with increasing E[n]. 
The decay comes about because in this regime, even though there are many 
mutations which do not disrupt the fold of a properly translated protein, 
these mutations increase the amount of mistranslated, misfolded proteins, 
and thus are selected against. The quantity E[n] can get arbitrarily close 
to L (or I for m > 0), and therefore K can decay to almost zero if A is 
sufficiently large. 

Throughout this study, we found good agreement among the numerical 
simulations, Sella-Hirsh theory, and our simple analytical approximation. 
Some discrepancies appeared between theory and simulations for the largest 
population size (N = 1000) and for very small a in conjunction with large 
A. We attribute the former to a violation of the condition NU <C 1, which 
must be satisfied for both Sella-Hirsh theory and our approximate solution. 
We carried out our simulations with a mutation rate of U — 0.001, which 
means that NU = 1 for iV = 1000. The latter discrepancies are caused 
by insufficient equilibration time. For large A, the number of neutral sites 
n always approaches L, irrespective of population size or a. However, the 
smaller a is, the lower the probability that a mutation occurs which increases 
n. Therefore, the equlibration time needed at large A grows without bound 
as a decreases. We did additional simulations in this regime, and found that 
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the simulation results approached the predicted quantities with increasing 
equilibration time (data not shown). 



DISCUSSION 

We have developed a simple model that describes the slowdown of the 
rate of evolution of highly expressed genes under selective pressure for trans- 
lational robustness. We have studied the model with numerical simulations 
and have solved the model exactly using Sella-Hirsh theory. We have also de- 
veloped a simple analytic approximation that is in excellent agreement with 
the predictions from Sella-Hirsh theory, and is valid for the entire range of 
possible parameter values (as long as NU <Cl). 

The model abstracts a pre vious thermodynami c model of protein muta- 



tional tolerance introduced by 



Bloom et al 



(2005|) in which mutations may 



leave unperturbed or destabilize the protein's native structure (common) or 
stabilize it (uncommon). Increases in stability tend to increase the number of 
sites at which substitutions can be tolerated, so-called neutral sites, while de- 
creases in stability usually cause misfolding or decrease the number of neutral 
sites. In the present work, we have modeled neutral sites directly. In doing 
so, we only allow stepwise changes in neutral sites, sacrificing treatment of 
large stability changes that might radically alter the number of neutral sites 
and the potential stability dependence of mutational effects in order to gain 
analytical tractability. 

Our results show a clear example of the paradox cited in the Introduc- 
tion (Fig. HJ): Given selection against the costs of protein misfolding, genes 
simultaneously become more mutationally tolerant (larger number of neutral 
sites, n) but evolve slower as if fewer mutations were tolerated. The para- 
dox's resolution requires disentangling two kinds of mutational tolerance, one 
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of which captures the likelihood of loss of protein function due to a mutation 
while the other quantifies the fitness cost of that mutation, the cost which 
ultimately determines evolutionary rate. When protein misfolding imposes 
minimal fitness costs, as is the case with low-expression proteins, the propor- 
tion of mutations which preserve protein function govern the rate of evolution 
(Fig. |U left). However, at high expression levels, fitness costs of mutations 
which preserve protein function grow large and can become the dominant 
determinant of evolutionary rate (Fig. HJ right). 

This observation has an important corollary. When selection for trans- 
lational robustness is weak, functional loss is likely the main determinant 
of fitness costs. Thus, our results suggest that evolutionary conservation of 
sites in low-expression proteins may be more likely to indicate functional 
importance than similar conservation at sites in high-expression proteins. 

Our simple model produces an exponential decline in evolutionary rate 
with increasing expressi on level, whereas in yeas t, a power law better de- 
scribes this relationship ((Drummond et a/.ll2005f ). Several possibilities may 
explain the discrepancy. First, our model assumes a symmetric binomial dis- 
tribution of the number of neutral sites, but the distribution for real proteins 
may be skewed or heavy-tailed. Second, the cost of additional misfolded pro- 
teins may not be independent of the number of alread y misfolded proteins. 



For e xample, misfolded proteins form toxic aggregates ((Bucciantini et al. 



20021 ) . and aggregation is not a linear function of protein concentration. Fi- 
nally, differences in protein structure and function between high - and low- 
expression proteins may play a role. IDrummond et al\ (j2005l ) have pre- 
viously examined differences between functionally and structurally similar 
paralogs and found a similar power-law relationship. However, more subtle 



but important differences may separate paralogs and influence their evolu- 
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tion. 

Our results here demonstrate that profound differences in protein evo- 
lutionary rates can arise even in the absence of functional and structural 
differences and when variables such as protein length, the translation error 
rate, and the underlying distribution of the number of neutral sites are held 
constant. In real genomes, of course, all these features vary and some, per- 
haps all, are under selection. The value of the model is its utility in explaining 



why highly expressed proteins evolve slowly across taxa (JDrummond et a! 



2005). 



Interestingly, our model reveals two evolutionary-rate regimes (Figs. ^ 
and |2J), one in which rates remain relatively constant with increasing protein 
production, and another in which rates decline precipitously. In yeast, vir- 
tually all genes appear to fall in the latter regime, as th e evolutionary rate 



declin es almost immediately from low to high expression ( Drummond et al 



20051 ) . raising the possibility of genome-level selection in this direction. If 



yeast protein synthesis levels reflect organismal needs and cannot be freely 
modulated, as seems likely, and protein synthesis costs dominate cellular 
energy consumption, as evidence suggests ([Princiotta et a/.ll20 03). the re- 
maining genome-level target for selection is on the fitness cost per misfolded 
protein, c. Decreasing c pushes genes away from the decline to where cost 
differences become negligible, whereas increasing c pushes genes toward the 
decline, amplifying the cost difference between high- and low-expression pro- 
teins. One way to decrease c is to drive translation errors down to negligible 
levels. Another is to maintain a quality-control apparatus (e.g. chaperones 
and proteases) with so much excess bandwidth that cost differences associ- 
ated with variability in protein misfolding become negligible. Evidence sug- 
gests that both strategies for decreasing c impose significant fitness penalties. 
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Decreasing translational er ror rates can be easi ly accomplished, often with a 
single ribosomal mutation ( Alksne et al but ribosomal accuracy and 

growth rate are often negatively correlated, presumably through the intrin- 
sic sp eed /accuracy tradeoff inherent in ribosomal proofreading (JKurland 



19921 ) . Maintenance of a chaperone fleet large enough to dilute out misfolding 



cost differences would divert enormous cellular resources for little benefit, and 
the massive induction of chaperones after heat shock suggests that cellular 
chaperone levels do not have much remaining bandwidth under normal con- 
ditions. Overall, it seems plausible that the steep decline observed in yeast's 
evolutionary-rate-expression relationship reflects a balance favoring a rela- 
tively high cost per misfolded protein c. Costly translational accuracy and 
quality control machinery may be reduced so long as the increased errors and 
reduced folding assistance can be compensated for; translational robustness 
provides that compensation — essentially for free — but is ultimately limited 
by mutation pressure away from robust sequences and by the fundamental 
intolerance of proteins to at least some errors. 

Our model distinguishes between the number of polypeptides produced 
per gene, x, and the abundance of functional proteins, A, yet our approxi- 
mate solution essentially equates these quantities with only minor accuracy 
loss. The approximation works for two reasons. First, misfolded polypep- 
tides impose a negligible cost for low-abundance proteins, while for high- 
abundance proteins, misfolded polypeptides are rare because of selection 
for translational robustness. We expect these nontrivial results to hold for 
many organisms. Second, in our model, the number of translation events 
x, the primary determinant of the number of mistranslated proteins, is es- 
timated accurately by A, a situation unlikely to hold for most organisms. 
Protein abundance reflects a balance of ongoing translation and turnover 
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(Greenbaum et al 



2003), such that a high abundance can result from ei- 



ther moderate translational frequency and long protein half-life or from rapid 
translation and short half-life. Because ha lf-lives can vary over orders of 
magnitude ( Hargrove and SchmidtIIi989 ). abundance and translation fre- 
quency may only weakly correlate in real organisms. Among protein abun- 
dance, mRNA expression level, and translation frequency, we hypothesize 
that the latter, even though difficult to measure, will best predict evolution- 
ary rate; 



Burger et al 



recently studied a question closely related to the 
present paper, asking why phenotypic mutation rates (corresponding to the 
translational error rate in the present paper) are much higher than genotypic 



mutation rates. Within the framework of their model. IBUR.GER. et all (J2005) 
found very little pressure for reduction of phenotypic mutation rates below a 
certain threshold. Even though we keep the translational error rate constant 
in our model, we can consider a change in the number of neutral sites n 
as a chang e in the phenotypic m utation rate, and thus compare our results 



to those of 



Burger et al 



In contrast to their conclusions, we find 
that the fraction of neutral sites, n/L, quickly rises to the maximum possible 
for highly expressed genes, thus reducing the phenotypic mutation rate to 
zero except when some sites cannot be made neutral. We believe that the 
differences in r e sults are caused by differences in the way in which we and 
(2005) tre ated costs of erroneously translated proteins in our 



Burger et al 



models. 



Burger et al 



( 20051 ) consider the total cost of protein synthesis, 
but do not consider additional penalties imposed by misfolded proteins, not 
only for their recognition and cleanup by the quality -control system but also 
for their innate toxicity ()Bucciantini et a/.l 120021 ) . Clearly, if we neglect 
these unique costs, then the only pressure to reduce the phenotypic mutation 
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rate is to reduce the cost of synthesis for misfolded proteins, and this pressure 
will be weak if this cost is only a small proportion of the total cost of protein 
synthesis. In our model, on the other hand, we have focused exclusively 
on costs of misfolded proteins apart from their synthesis costs, implicitly 
assuming that the total cost of protein synthesis is approximately equal to the 
cost of synthesis of functional proteins, and that the benefit of the functional 
proteins will pay for their synthesis. We believe that there is indeed a strong 
selective pressure to reduce the phenotypic mutation rate for highly expressed 
genes, but that it is cheaper for cells to evolve translationally robust genes 
than to evolve highly accurate transcription and translatio n machinery. 



Can translational robustness really be obtained cheaply? 



Drummond et al 



(200a) have suggested that increased protein stability both confers muta- 
tional tolerance and constrains sequence evolution. Increasing protein sta- 
bility, that is, decreasing the free energy of folding AGf, provides a plausi- 
ble mechanism for obtaining translational robustness for numerous reasons. 
First, increased stability leads to increased mutatio nal tolerance and can be 
obtained by point mutations ()Bloom et a/.l 120051 ). Second, the stability- 
increase mechanism is sufficiently general to encompass proteins of diverse 
functions and to operate in a wide range of organisms. Third, stability is free 
in the sense that obtaining a protein with lower AG/ requires only a chance 
mutation. While many researchers have noted an apparent tradeoff between 
protein stability and enzymatic activity, it is crucial to emphasize that this 
trend may be statistical rather than intrinsic: Because both high activity 
and high stability are rare properties, mutati ons that improve b oth are ex- 
ceedingly unlikely unless both are constrained ( Giver et a/ll998 ). Selection 
for translational robustness provides precisely that dual constraint, and be- 
cause many millions of mutations may be screened over evolutionary time, 
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the very few resulting in highly expressed proteins with increased stability 
(conferring tolerance to translation errors) and uncompromised activity will 
be found. Finally, the very rarity of such stabilizing mutations provides a 
measure of the scarcity of highly stable proteins available for exploration by 
evolutionary drift. If increased stability is a dominant response to the need 
for mutational tolerance in highly expressed proteins, it will restrict drift and 
slow evolution relative to less-constrained low- expression proteins. 
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APPENDIX 

Here we derive an expression for the evolutionary rate K as a function of 
the number of neutral sites n* rather than the protein abundance A. For the 
remainder of this appendix, we drop the superscript from n* for simplicity. 
We begin by noting that Eq. (fTB|) implies 

r?,7r_ = a(L — n)n + . (22) 
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Further, note that we can write, for sufficiently large N, 

n_ = e- 2NAr n + . (23) 

Therefore, e~ 2NAr = a(L — ri)/n. We can solve this expression for A and 
find 

A=^\n[a(L-n)/n}. (24) 
After inserting Eq. (J2l"J) into the definition of 7r_, we obtain 



1 — exp ( -j 1 - \n[a(L — n)/n] 
7T_ = ) f . (25) 

1 — exp I — ln[a(L — n)/n\ 



We expand this expression for large N, replacing the exponential in the 
numerator by the first two terms of its Taylor series, and find 

1 a(L — n) , r . r . . , . , 

tt_ = — - v . J —\n[a(L-n) n}. 26 

N aL — (a + l)n " J v ' 

Now, after inserting Eqs. (}2*2*|) and (|2*6^) into Eq. (JB]l. we obtain for the ex- 
pected evolutionary rate 

K = 2U ™ { f~ n L ln[a(L " n),n] ■ (27) 
aL z — [a + l)Ln 

Note that this expression is independent of the population size N. Even 
though we have derived it under the assumption that iV is large, we find that 
it works very well even for moderate population sizes of 100 or less. 
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Protein abundance A 

Figure 1: Evolutionary rate K (measured in substitutions per 4 x 10 5 
generations) versus the protein abundance A, for population sizes N = 
100,300, 1000 (a = 0.1). Data points indicate simulation results, solid lines 
indicate the prediction using Sella-Hirsh theory, Eq. (JHJ), and dashed lines 
indicate the prediction using our approximate solution, Eq. f|18|) . Error bars 
indicate standard errors. 
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Figure 2: Evolutionary rate K (measured in substitutions per 4 x 10 5 gen- 
erations) versus the protein abundance A, for values of a = 0.03,0.1,0.3 
(N = 100). Data points indicate simulation results, solid lines indicate the 
prediction using Sella-Hirsh theory, Eq. ©, and dashed lines indicate the 
prediction using our approximate solution, Eq. (fTHj) . Error bars indicate 
standard errors. The deviation from the prediction of the two rightmost 
data points for a = 0.03 is caused by insufficient equilibration time. 
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Figure 3: Expected fitness E[w] (a) and expected number of neutral sites E[n] 
(b) versus the protein abundance A, for iV = 100 and a = 0.1. Data points 
indicate simulation results, solid lines indicate the prediction using Sella- 
Hirsh theory, and dashed lines indicate the prediction using our approximate 
solution. Error bars indicate standard errors. 
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Expected number of neutral sites E[n] 

Figure 4: Evolutionary rate versus expected number of neutral sites E[n], 
for population sizes iV = 100, 300, 1000 (a = 0.1). The solid line stems from 
Eq. ()27|) in the Appendix. Error bars indicate standard errors. 
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a 


probability ot turning a non-neutral into a neutral site 


A 


number (abundance) ol tolded proteins 


f 


P i • P P 1 1 1 i ■ 

traction ot tolded proteins 


t 


Boltzmann tactor 


K 


number of amino-acid substitutions since most 
recent common ancestor (evolutionary rate) 


I 


number of possibly-neutral sites 


L 


protein length 


m 


number of never-neutral sites 


n 


number of neutral sites 


N 


population size 


t 


time, in generations 


r 


translation error probability per site 


U 


mutation rate per sequence 


w n 


fitness of sequence with n neutral sites 



Table 1: Variables used in this work. 
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